Efficient and Fault-Tolerant Distributed Host Monitoring Using System-Level Diagnosis
نویسندگان
چکیده
This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.
منابع مشابه
A New Fault Tolerant Nonlinear Model Predictive Controller Incorporating an UKF-Based Centralized Measurement Fusion Scheme
A new Fault Tolerant Controller (FTC) has been presented in this research by integrating a Fault Detection and Diagnosis (FDD) mechanism in a nonlinear model predictive controller framework. The proposed FDD utilizes a Multi-Sensor Data Fusion (MSDF) methodology to enhance its reliability and estimation accuracy. An augmented state-vector model is developed to incorporate the occurred senso...
متن کاملOnline Monitoring and Fault Diagnosis of Multivariate-attribute Process Mean Using Neural Networks and Discriminant Analysis Technique
In some statistical process control applications, the process data are not Normally distributed and characterized by the combination of both variable and attributes quality characteristics. Despite different methods which are proposed separately for monitoring multivariate and multi-attribute processes, only few methods are available in the literature for monitoring multivariate-attribute proce...
متن کاملFault Tolerant DNA Computing Based on Digital Microfluidic Biochips
Historically, DNA molecules have been known as the building blocks of life, later on in 1994, Leonard Adelman introduced a technique to utilize DNA molecules for a new kind of computation. According to the massive parallelism, huge storage capacity and the ability of using the DNA molecules inside the living tissue, this type of computation is applied in many application areas such as me...
متن کاملSTAR: A Fault-Tolerant System for Distributed Applications
This paper presents a fault-tolerant manager for distributed applications. This manager provides an efficient recovery of hosts’ failures on networks of workstations. An independent checkpointing is used to automatically recover application processes affected by host failures. Domino-effects are avoided by means of message logging and file versions management. STAR provides an efficient softwar...
متن کاملAdvanced design scheme for fault tolerant distributed networked control systems
This paper addresses the integrated design of fault tolerant distributed networked control systems (NCS). The NCS under consideration consists of two levels. At the lower level, sensors, actuators and local controllers are embedded and networked by sub-nets. They coordinated and supevised by the control stations located at the higher level. The core of the design scheme is the integrated design...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996